A significant portion of the checking was done in the Excel file 'manual verification.'
In essence, I searched the state's site (https://apps.state.or.us/cf2/spd/facility_complaints/) using the same criteria as in the scraper, then copy-pasted the resulting list of totals by facility into a spreadsheet. I summed them up and compared them to what the scraper got me.
There were some differences between the two.
In [5]:
import pandas as pd
import numpy as np
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:100% !important; }</style>"))
In [6]:
scraped_comp = pd.read_csv('../data/scraped/scraped_complaints_3_25.csv')
In [7]:
scraped_comp['abuse_number'] = scraped_comp['abuse_number'].apply(lambda x: x.upper())
In [8]:
manual = pd.read_excel('/Users/fzarkhin/OneDrive - Advance Central Services, Inc/fproj/github/database-story/scraper/manual verification.xlsx', sheetname='All manual')
In [9]:
manual = manual.groupby('name').sum().reset_index()
In [10]:
manual['name']= manual['name'].apply(lambda x: x.strip())
scraped_comp['fac_name']= scraped_comp['fac_name'].apply(lambda x: x.strip())
In [11]:
df = scraped_comp.groupby('fac_name').count().reset_index()[['fac_name','abuse_number']]
In [12]:
merge1 = manual.merge(df, how = 'left', left_on = 'name', right_on='fac_name')
Five facilities did not correspond. Manual checks shows inaccurate online data.
In [13]:
merge1[merge1['count']!=merge1['abuse_number']].sort_values('abuse_number')#.sum()
Out[13]:
In [14]:
manual[manual['name']=='AVAMERE AT SANDY']
Out[14]:
In [15]:
scraped_comp[scraped_comp['abuse_number']=='BH116622B']
Out[15]:
In [16]:
scraped_comp[scraped_comp['fac_name'].str.contains('FLAGSTONE RETIREME')]
Out[16]:
In [17]:
merge2 = manual.merge(df, how = 'right', left_on = 'name', right_on='fac_name')
In [18]:
merge2[merge2['count']!=merge2['abuse_number']].sort_values('count')#.sum()
Out[18]: